1 Data visualization

1.1 Load clean data

Load the cleaned data from the previous steps done in data_preparation.rmd file.

koi_data <- readRDS("data/Rdas/koi_data.Rda")

1.2 Correlation matrix

Create a correlation matrix to understand the relationships between variables.

# Select only numeric columns for correlation
numerical_cols <- koi_data %>%
  select(
    koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
    koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
    koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
    koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
  ) %>%
  drop_na()
# Calculate the correlation matrix
cor_matrix <- cor(numerical_cols)
# Visualize the correlation matrix
ggcorrplot(cor_matrix,
  hc.order = TRUE, # Hierarchical clustering
  type = "upper", # Show upper triangle
  lab = TRUE, # Show correlation coefficients
  lab_size = 3, # Adjust label size
  method = "circle", # Use circles to represent correlation
  colors = c("#6D9EC1", "white", "#E46726")
) # Specify color scheme

The correlation matrix shows us that there are some strong relationships between some variables. For example, the correlation between koi_period and koi_duration is 0.99, indicating a very strong positive relationship. This suggests that as the orbital period increases, the transit duration also tends to increase.

1.3 PCA analysis

Perform PCA on the selected numerical variables.

numerical_pca_cols <- koi_data %>%
  select(
    koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
    koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
    koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
    koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
  )

disposition_col <- koi_data$koi_pdisposition
pca_data_complete <- numerical_pca_cols %>% drop_na()
disposition_complete <- disposition_col[complete.cases(numerical_pca_cols)]

if (length(disposition_complete) != nrow(pca_data_complete)) {
  stop("Mismatch between data rows and disposition labels after handling NAs.")
}

# Scale the Data (Standardize)
scaled_pca_data <- scale(pca_data_complete)
pca_result <- prcomp(scaled_pca_data, center = FALSE, scale. = FALSE)

1.3.1 PCA Summary

Shows proportion of variance explained by each component

summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.8409 1.7355 1.6688 1.5467 1.24685 1.12924 1.09109
## Proportion of Variance 0.1694 0.1506 0.1393 0.1196 0.07773 0.06376 0.05952
## Cumulative Proportion  0.1694 0.3200 0.4593 0.5789 0.65663 0.72039 0.77992
##                           PC8     PC9   PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.9317 0.83983 0.8246 0.6914 0.66262 0.62824 0.52575
## Proportion of Variance 0.0434 0.03527 0.0340 0.0239 0.02195 0.01973 0.01382
## Cumulative Proportion  0.8233 0.85858 0.8926 0.9165 0.93844 0.95817 0.97199
##                           PC15    PC16    PC17    PC18    PC19    PC20
## Standard deviation     0.45004 0.41454 0.33987 0.23551 0.10610 0.05948
## Proportion of Variance 0.01013 0.00859 0.00578 0.00277 0.00056 0.00018
## Cumulative Proportion  0.98212 0.99071 0.99649 0.99926 0.99982 1.00000
fviz_eig(pca_result, addlabels = TRUE)

From the eigenvalues, we can see that the first two principal components explain approximately 32% of the total variance. This suggests that the first two principal components does not capture much of the variability in the data. We need the first 11 PCA to get over 90% of the variance, suggesting that the underlying structure of the data (based on these numerical variables) is quite complex. There isn’t a simple, low-dimensional linear subspace that captures most of the information.

1.3.2 PCA Loadings

Show how original variables contribute to each PC using rotation matrix. The loadings tell us how much each original variable contributes to each principal component. Larger absolute values mean stronger influence. The sign (+/-) indicates the direction of the correlation.

print(pca_result$rotation)
##                        PC1          PC2          PC3          PC4         PC5
## koi_period      0.30639694  0.347714363  0.274495160 -0.008366897  0.01847917
## koi_duration    0.04269443  0.231972796  0.115843053 -0.015138855 -0.07234090
## koi_depth      -0.12450017  0.127549414 -0.052461072 -0.140470095 -0.62521527
## koi_prad       -0.15286222  0.058554951  0.120548664 -0.435568096  0.11140140
## koi_teq        -0.42635722 -0.120328101  0.183977631  0.122937219  0.02213583
## koi_insol      -0.14607512 -0.123562878  0.304706726  0.046459281 -0.09016303
## koi_model_snr  -0.11924810  0.147469117 -0.048934105 -0.089975397 -0.62370781
## koi_steff      -0.26437111  0.362258812 -0.072322603  0.209410982  0.07667573
## koi_slogg       0.25605738 -0.025127674 -0.435100454 -0.115861946  0.02756831
## koi_srad       -0.17033395 -0.129612484  0.444175616  0.034644448 -0.08661357
## koi_smass      -0.29636009  0.203438378  0.260224242  0.222440536  0.09382075
## koi_impact     -0.19050425  0.118816716  0.001398144 -0.508153075  0.22162035
## koi_ror        -0.17094262  0.131493527  0.009384989 -0.548172046  0.04619732
## koi_srho        0.06774909  0.066623582  0.081594434 -0.177212392  0.11099283
## koi_sma         0.30796798  0.361668040  0.286202772  0.001518795  0.01479373
## koi_incl        0.25332897 -0.005414288  0.049045471  0.057546845 -0.23841575
## koi_dor         0.29903665  0.280085628  0.250674161 -0.041381312  0.02445484
## koi_ldm_coeff1  0.21466163 -0.409881812  0.244745270 -0.174726148 -0.08714585
## koi_ldm_coeff2 -0.16513550  0.376579150 -0.276071465  0.151030824  0.10000747
## koi_smet        0.06242363 -0.095393670  0.123653316  0.091883359  0.17810853
##                         PC6          PC7         PC8         PC9        PC10
## koi_period      0.119360611  0.051875537 -0.06019383 -0.02002505 -0.18962741
## koi_duration    0.570839165 -0.215695497  0.20280440  0.23681782  0.55006942
## koi_depth      -0.068584639 -0.079795012  0.11874165  0.06465849 -0.13349936
## koi_prad       -0.131240154 -0.047792214 -0.09561726 -0.12698835  0.32123182
## koi_teq        -0.019868440  0.150918255  0.13576536 -0.03384746 -0.22829643
## koi_insol       0.041738202  0.369541809 -0.31817638  0.64825953 -0.02700493
## koi_model_snr  -0.054104702 -0.108447734  0.11403543  0.07830158 -0.08374526
## koi_steff      -0.126839948 -0.122390073 -0.02680701  0.03044942 -0.06753539
## koi_slogg       0.006172822  0.078111944 -0.06601230  0.34949241 -0.13511190
## koi_srad        0.038663393  0.194322150 -0.15544735 -0.03160191  0.15628893
## koi_smass      -0.160901331 -0.307606379  0.06812431 -0.12983491 -0.02450837
## koi_impact      0.093935585 -0.062617207 -0.13387480  0.03102445 -0.17452222
## koi_ror        -0.050116478 -0.081455459 -0.18216056  0.02736957 -0.09446960
## koi_srho       -0.487708565  0.284586266  0.61362328  0.20829062  0.32003174
## koi_sma         0.097524343 -0.007427604 -0.09031723 -0.02579694 -0.09548265
## koi_incl       -0.460956458 -0.064815858 -0.51388871 -0.15445623  0.37733221
## koi_dor        -0.175128087  0.185140235  0.14716568 -0.01272897 -0.31841721
## koi_ldm_coeff1  0.066040003 -0.172049829  0.13495052 -0.07410840 -0.09869411
## koi_ldm_coeff2 -0.062357540  0.202944864 -0.15945606  0.12053377  0.13967629
## koi_smet       -0.281578572 -0.645526832 -0.01668499  0.51474574 -0.09385791
##                       PC11         PC12        PC13         PC14         PC15
## koi_period     -0.11820578  0.061591783 -0.01431356  0.076013208  0.459993894
## koi_duration    0.12960162  0.109252424  0.00432354  0.066515371 -0.103379379
## koi_depth      -0.04267725  0.396045760  0.54672910 -0.064810340 -0.012008518
## koi_prad       -0.73163329  0.182306235 -0.15406092 -0.005183395 -0.095637702
## koi_teq        -0.12060014  0.185245607 -0.00728660  0.303339231  0.376869420
## koi_insol       0.04266643  0.177584952 -0.20029275  0.179851481 -0.183178017
## koi_model_snr  -0.09554845 -0.510837854 -0.49530116  0.042730159  0.058988510
## koi_steff       0.10671698  0.361573273 -0.35262834 -0.486478946  0.016285554
## koi_slogg      -0.11500813  0.186729780 -0.12435748 -0.334522107  0.120528755
## koi_srad       -0.01654610 -0.328290690  0.26449586 -0.659210103  0.102248018
## koi_smass       0.15456549  0.084960880 -0.08450251  0.080795833 -0.200376338
## koi_impact      0.32277566 -0.149681992 -0.01361718 -0.031637396  0.047790992
## koi_ror         0.28721255  0.004764452  0.09098955  0.080315291  0.004165243
## koi_srho        0.20516679 -0.005572167 -0.02150910 -0.021955538  0.210846592
## koi_sma        -0.04104112  0.017431854 -0.02271675  0.016874341  0.262161956
## koi_incl        0.25593040  0.131709318 -0.05710512  0.143027074  0.077333599
## koi_dor        -0.07984545 -0.055057826  0.05838449 -0.012100698 -0.627522154
## koi_ldm_coeff1  0.06933213  0.130826127 -0.13079360 -0.004982672 -0.011208722
## koi_ldm_coeff2 -0.13521739 -0.297682322  0.30880508  0.202161085  0.004390644
## koi_smet       -0.18008029 -0.194697850  0.21481309  0.003904580  0.092571406
##                       PC16         PC17         PC18          PC19
## koi_period      0.03802923 -0.029875167 -0.021412450  6.437068e-01
## koi_duration   -0.29905083 -0.103337964  0.096615121  2.303355e-03
## koi_depth       0.16802341  0.069849268  0.112401644  1.515136e-03
## koi_prad        0.06895291  0.002454426  0.048781159  2.186547e-03
## koi_teq        -0.52158913 -0.189546320  0.179027991 -1.908513e-01
## koi_insol       0.22205127  0.087711186 -0.042019739  4.009233e-02
## koi_model_snr  -0.03694752 -0.039322810  0.004593793 -4.513809e-03
## koi_steff      -0.16381831  0.350713071 -0.051366675  1.855004e-02
## koi_slogg      -0.07264963 -0.607563323  0.095505260 -7.671468e-02
## koi_srad       -0.09304371 -0.165404261 -0.006564376  8.813710e-03
## koi_smass       0.38913201 -0.600489892  0.005196951  7.156391e-02
## koi_impact      0.11422003  0.096604457  0.653988234 -8.810851e-05
## koi_ror        -0.24061955 -0.120915818 -0.654949610 -1.554008e-03
## koi_srho        0.10595934  0.023054427 -0.010784763  1.216639e-03
## koi_sma         0.22260594  0.040643864 -0.077415692 -7.320998e-01
## koi_incl       -0.24358061 -0.075401764  0.220602020 -2.777929e-03
## koi_dor        -0.38968330 -0.050942658  0.132763452  2.187147e-03
## koi_ldm_coeff1  0.03353477 -0.044421110 -0.002719654 -7.414143e-03
## koi_ldm_coeff2  0.01801750 -0.069436056  0.011572288 -8.901582e-03
## koi_smet       -0.12982706  0.130710575  0.017994995 -5.298896e-03
##                         PC20
## koi_period      6.044490e-03
## koi_duration    5.976220e-04
## koi_depth      -1.064006e-03
## koi_prad        1.674146e-04
## koi_teq         2.041897e-03
## koi_insol      -1.814064e-03
## koi_model_snr   5.295556e-03
## koi_steff       2.245508e-01
## koi_slogg      -1.346323e-05
## koi_srad        1.387013e-02
## koi_smass      -9.313680e-03
## koi_impact      5.237360e-03
## koi_ror        -5.471374e-03
## koi_srho        3.722707e-04
## koi_sma        -5.138471e-03
## koi_incl       -4.704890e-04
## koi_dor        -2.463456e-04
## koi_ldm_coeff1  7.603596e-01
## koi_ldm_coeff2  6.073264e-01
## koi_smet       -4.634635e-02

Visualize Loadings for PC1 and PC2

print("Loadings Plot for PC1 vs PC2:")
## [1] "Loadings Plot for PC1 vs PC2:"
fviz_pca_var(pca_result,
  col.var = "contrib", # Color by contributions
  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
  repel = TRUE
)

Analysis of the component loadings revealed distinct patterns captured by the principal components.

  • PC1 (~17% Var): Seems to represent a contrast between orbital size/period and temperature. It has high positive loadings for koi_period, koi_sma, koi_dor (larger orbits) and high negative loadings for koi_teq (cooler temperatures associated with larger orbits). Stellar properties (koi_slogg, koi_steff, koi_smass) also contribute moderately.
  • PC2 (~15% Var): Also strongly related to orbital size/period (positive loadings for koi_period, koi_sma, koi_dor) but also strongly incorporates stellar temperature (koi_steff positive loading) and limb darkening (koi_ldm_coeff1 negative, koi_ldm_coeff2 positive).
  • PC3 (~14% Var): Primarily related to stellar properties, contrasting stellar radius/insolation (koi_srad, koi_insol positive) with stellar surface gravity (koi_slogg negative). Orbital size variables also contribute moderately.
  • PC4 (~12% Var): Dominated by relative planet size and transit geometry, with high negative loadings for koi_prad, koi_ror (planet/star radius ratio), and koi_impact.
  • PC5 (~8% Var): Represents the transit signal strength, dominated by high negative loadings for koi_depth and koi_model_snr.
  • Later PCs: Capture more nuanced relationships. PC6 relates transit duration and stellar density (koi_duration, koi_srho). PC7 involves insolation and metallicity (koi_insol, koi_smet). PC19/PC20 seem to isolate specific period/axis relationships and limb darkening effects.

These interpretations suggest that the primary sources of variation in the dataset relate to the transit signal strength, stellar characteristics, transit geometry, and orbital properties.

1.3.3 PCA Plots

Combine PCA results with the disposition information and plot the results.

pca_plot_data <- data.frame(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  Disposition = disposition_complete
)

autoplot(pca_result,
  data = data.frame(pca_data_complete, Disposition = disposition_complete), colour = "Disposition",
  loadings = TRUE, loadings.colour = "blue",
  loadings.label = TRUE, loadings.label.size = 3
) +
  labs(title = "PCA Plot with Loadings") +
  theme_minimal()

fviz_pca_ind(pca_result,
  geom.ind = "point", # show points only (but can use "text")
  col.ind = disposition_complete, # color by groups
  palette = "jco", # Journal color palette
  addEllipses = TRUE, # Concentration ellipses
  legend.title = "Disposition"
) +
  ggtitle("PCA Plot of Individuals")

pca_scores_df_7 <- data.frame(pca_result$x[, 1:7], Disposition = disposition_complete)

ggpairs(pca_scores_df_7,
  columns = 1:7, # Specify columns for the PC dimensions
  aes(color = Disposition, alpha = 0.6), # Map color and transparency to Disposition
  upper = list(continuous = wrap("cor", size = 3)), # Show correlation in upper panels
  lower = list(continuous = wrap("points", size = 1)), # Show scatter plots in lower panels
  diag = list(continuous = wrap("densityDiag", alpha = 0.5)), # Show density plots on diagonal
  title = "Pairs Plot Matrix of First 7 Principal Components"
) +
  theme_minimal() + # Apply a theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

1.4 Visualization

ggplot(
  koi_data %>% filter(!is.na(koi_impact), !is.na(koi_duration), !is.na(koi_pdisposition)),
  aes(x = koi_impact, y = koi_duration, color = koi_pdisposition)
) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(
    title = "Impact Parameter vs. Transit Duration",
    x = "Impact Parameter (koi_impact)",
    y = "Transit Duration [hours] (koi_duration)",
    color = "Pipeline Disposition"
  ) +
  theme_minimal()

ggplot(
  koi_data %>% filter(!is.na(koi_impact), !is.na(koi_depth), !is.na(koi_pdisposition)),
  aes(x = koi_impact, y = koi_depth, color = koi_pdisposition)
) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_y_log10() + # Depth often varies widely
  labs(
    title = "Impact Parameter vs. Transit Depth",
    x = "Impact Parameter (koi_impact)",
    y = "Transit Depth [ppm] (koi_depth) (log scale)",
    color = "Pipeline Disposition"
  ) +
  theme_minimal()

ggplot(
  koi_data %>% filter(!is.na(koi_smet), !is.na(koi_prad), !is.na(koi_pdisposition)),
  aes(x = koi_smet, y = koi_prad, color = koi_pdisposition)
) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_y_log10() + # Planet radius often plotted on log scale
  labs(
    title = "Stellar Metallicity vs. Planetary Radius",
    x = "Stellar Metallicity [Fe/H] (koi_smet)",
    y = "Planetary Radius [Earth Radii] (koi_prad) (log scale)",
    color = "Pipeline Disposition"
  ) +
  theme_minimal()

ggplot(koi_data %>% filter(!is.na(koi_prad)), aes(x = koi_prad)) +
  geom_histogram(binwidth = 0.1) + # Adjust binwidth as needed
  scale_x_log10() +
  labs(title = "Distribution of Planetary Radii", x = "Planetary Radius [Earth Radii] (log scale)", y = "Count")

ggplot(koi_data %>% filter(!is.na(koi_period)), aes(x = koi_period)) +
  geom_histogram() + # ggplot chooses bins, or set binwidth/bins
  scale_x_log10() +
  labs(title = "Distribution of Orbital Periods", x = "Orbital Period [Days] (log scale)", y = "Count")

Period vs. Radius: A classic plot in exoplanet studies. Color by disposition.

ggplot(
  koi_data %>% filter(!is.na(koi_prad), !is.na(koi_period)),
  aes(x = koi_period, y = koi_prad, color = koi_disposition)
) +
  geom_point(alpha = 0.5, size = 1.5) + # Adjust alpha/size
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "Orbital Period vs. Planetary Radius",
    x = "Orbital Period [Days] (log scale)",
    y = "Planetary Radius [Earth Radii] (log scale)",
    color = "Disposition"
  ) +
  theme_minimal() # Or other themes

Insolation/Temperature vs. Radius: Explore potential atmospheric regimes.

ggplot(
  koi_data %>% filter(!is.na(koi_prad), !is.na(koi_insol)),
  aes(x = koi_insol, y = koi_prad, color = koi_disposition)
) +
  geom_point(alpha = 0.5) +
  scale_x_log10() + # Insolation often spans orders of magnitude
  scale_y_log10() +
  labs(
    title = "Insolation Flux vs. Planetary Radius",
    x = "Insolation Flux [Earth Flux] (log scale)",
    y = "Planetary Radius [Earth Radii] (log scale)",
    color = "Disposition"
  )

Stellar Temperature vs. Stellar Radius/Mass: Explore stellar properties (like an H-R diagram).

ggplot(
  koi_data %>% filter(!is.na(koi_steff), !is.na(koi_srad)),
  aes(x = koi_steff, y = koi_srad)
) +
  geom_point(alpha = 0.3) +
  scale_x_reverse() + # Convention for H-R diagrams
  scale_y_log10() +
  labs(
    title = "Stellar Properties (H-R Diagram Analog)",
    x = "Stellar Effective Temperature [K]",
    y = "Stellar Radius [Solar Radii] (log scale)"
  )

Use boxplots or violin plots to compare distributions between disposition categories.

# Compare Transit SNR for different dispositions
ggplot(
  koi_data %>% filter(!is.na(koi_model_snr)),
  aes(x = koi_disposition, y = koi_model_snr, fill = koi_disposition)
) +
  geom_boxplot() + # Or geom_violin()
  scale_y_log10() + # If SNR varies widely
  labs(
    title = "Transit Signal-to-Noise by Disposition",
    x = "Disposition", y = "Transit SNR (log scale)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Improve label readability

See how parameters differ when specific flags are raised.

# Compare transit depth for objects flagged/not flagged as stellar eclipses (SS)
koi_data %>%
  filter(!is.na(koi_depth)) %>%
  mutate(ss_flag = as.factor(koi_fpflag_ss)) %>% # Make flag a factor for plotting
  ggplot(aes(x = ss_flag, y = koi_depth, fill = ss_flag)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(
    title = "Transit Depth Comparison for Stellar Eclipse Flag",
    x = "Stellar Eclipse Flag (koi_fpflag_ss)",
    y = "Transit Depth [ppm] (log scale)"
  )